3. Exploratory Data Analysis

Exploratory data analysis (EDA) involves taking a first look at a dataset and summarising its salient characteristics using tables and graphics. It is (or should be) the stage before testing hypotheses and can be useful in informing hypotheses. In this chapter, we’ll look at a few options for EDA using code.

To show how to do EDA using code, we will need a dataset to explore. We’ll use the Grinell housing dataset, which covers houses sold between 2005 and 2015 in Grinnell, Iowa. It was assembled by a local estate agent, Matt Karjalahti.

First, let’s import a couple of essential packages:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from itertools import islice

# Plot settings
plt.style.use("https://github.com/aeturrell/coding-for-economists/raw/main/plot_style.txt")

3.1. Preliminaries: extracting, typing, and initial cleaning

Let’s grab some data to do exploratory data analysis on.

df = pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/Stat2Data/GrinnellHouses.csv", index_col=0)
df.head()
Date Address Bedrooms Baths SquareFeet LotSize YearBuilt YearSold MonthSold DaySold CostPerSqFt OrigPrice ListPrice SalePrice SPLPPct
1 16695 1510 First Ave #112 2 1.0 1120.0 NaN 1993 2005 9 16 6.25 17000 10500 7000 66.67
2 16880 1020 Center St 3 1.0 1224.0 0.172176 1900 2006 3 20 22.06 35000 35000 27000 77.14
3 16875 918 Chatterton St 4 1.0 1540.0 NaN 1970 2006 3 15 18.18 54000 47000 28000 59.57
4 16833 1023 & 1025 Spring St. 3 1.0 1154.0 NaN 1900 2006 2 1 26.00 65000 49000 30000 61.22
5 16667 503 2nd Ave 3 1.0 1277.0 0.206612 1900 2005 8 19 24.08 35000 35000 30750 87.86

There are some things we should do even before an exploratory data analysis; we should make sure we’ve cast the columns to the right types and converted any that need converting. Even a quick look at the data provided by .head() shows that the Date column is in a weird format, the number of Baths is a floating point number rather than an integer (is it possible to have half a bathroom? Maybe, but it doesn’t sound very private), and there are some NaNs in there too. It’s not clear what the fractional values of bathrooms mean (including from the documentation) so we’ll just have to take care with that variable.

Fortunately, there is guidance attached to the dataset as to what the Date column really means: it’s just a counter of days that encodes the date of sale that begins with 1st Jan 2005=16436. The documents also tell us that the final year is 2015. We’d really like to have a more usable datetime column than this so let’s sort this out first. Ideally, we’d like to map the Date column into a new datetime column that turns 16436 into 1st Jan 2005, 16437 into 2nd Jan 2005, and so on, until we have a mapping for every date between the first date in the dataset and the last.

Our solution to create this mapping will involve a couple of steps. First, we’ll use pd.date_range to get all of the dates from 1st Jan 2005 for a number of periods. We’ll determine the number of periods from the maximum value of the Date column (which is the last date represented) minus the start counter, which we know is 16436. To create a mapping from the coded numbers to the dates, we’ll use the zip function, which pairs two sets of numbers together, and the dict function so that we get key, value pairs that we can conveniently feed into a dataframe.

Here’s the first step, creating a dictionary mapping:

start_code = 16436
end_code = df['Date'].max() + 1 # +1 because of how ranges are computed; we want to *include* the last date

datetime_dict = dict(zip(range(start_code, end_code),
                               pd.date_range(start='2005/01/01', periods=end_code-start_code)))

# Look at first few dict entries:
list(islice(datetime_dict.items(), 5))
[(16436, Timestamp('2005-01-01 00:00:00', freq='D')),
 (16437, Timestamp('2005-01-02 00:00:00', freq='D')),
 (16438, Timestamp('2005-01-03 00:00:00', freq='D')),
 (16439, Timestamp('2005-01-04 00:00:00', freq='D')),
 (16440, Timestamp('2005-01-05 00:00:00', freq='D'))]

Alright, now we want to use this new datetime format. We use an apply at the row level:

df['datetime'] = df['Date'].apply(lambda x: datetime_dict[x])
# Check the biggest rows by datetime
df.nlargest(5, columns=['datetime'])
Date Address Bedrooms Baths SquareFeet LotSize YearBuilt YearSold MonthSold DaySold CostPerSqFt OrigPrice ListPrice SalePrice SPLPPct datetime
915 20186 957 350th Ave 3 2.00 1700.0 11.430000 1966 2015 4 8 128.82 229000 229000 219000 95.63 2015-04-09
886 20185 232 4th Ave W 3 2.50 NaN 0.139991 2004 2015 4 7 0.00 137500 137500 128500 93.45 2015-04-08
905 20185 1726 Summer St 4 1.75 1512.0 0.530000 1996 2015 4 7 125.00 239900 219900 189000 85.95 2015-04-08
920 20184 108 East St 4 3.00 1650.0 0.320000 2013 2015 4 6 156.97 259000 259000 259000 100.00 2015-04-07
834 20181 1405 Prince St 3 1.50 1094.0 NaN 1900 2015 4 3 18.28 32500 32500 20000 61.54 2015-04-04

An extra column containing datetime has been added and, looking at the max rows, we can see that it does indeed run all the way to 2015 as expected from the documentation of the dataset.

Okay, now we want to sort out the data type isses we saw earlier. But let’s just check they’re as bad as we think using df.info()

df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 929 entries, 1 to 929
Data columns (total 16 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Date         929 non-null    int64         
 1   Address      929 non-null    object        
 2   Bedrooms     929 non-null    int64         
 3   Baths        929 non-null    float64       
 4   SquareFeet   911 non-null    float64       
 5   LotSize      741 non-null    float64       
 6   YearBuilt    929 non-null    int64         
 7   YearSold     929 non-null    int64         
 8   MonthSold    929 non-null    int64         
 9   DaySold      929 non-null    int64         
 10  CostPerSqFt  929 non-null    float64       
 11  OrigPrice    929 non-null    int64         
 12  ListPrice    929 non-null    int64         
 13  SalePrice    929 non-null    int64         
 14  SPLPPct      929 non-null    float64       
 15  datetime     929 non-null    datetime64[ns]
dtypes: datetime64[ns](1), float64(5), int64(9), object(1)
memory usage: 123.4+ KB

pandas read most of them in sensibly, but not all of them, so let’s adjust the ones we need to:

df = df.assign(Address = df['Address'].astype('string'),
               Bedrooms = df['Bedrooms'].astype('category'),
               OrigPrice=df['OrigPrice'].astype(float),
               ListPrice=df['ListPrice'].astype(float),
               SalePrice=df['SalePrice'].astype(float))
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 929 entries, 1 to 929
Data columns (total 16 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Date         929 non-null    int64         
 1   Address      929 non-null    string        
 2   Bedrooms     929 non-null    category      
 3   Baths        929 non-null    float64       
 4   SquareFeet   911 non-null    float64       
 5   LotSize      741 non-null    float64       
 6   YearBuilt    929 non-null    int64         
 7   YearSold     929 non-null    int64         
 8   MonthSold    929 non-null    int64         
 9   DaySold      929 non-null    int64         
 10  CostPerSqFt  929 non-null    float64       
 11  OrigPrice    929 non-null    float64       
 12  ListPrice    929 non-null    float64       
 13  SalePrice    929 non-null    float64       
 14  SPLPPct      929 non-null    float64       
 15  datetime     929 non-null    datetime64[ns]
dtypes: category(1), datetime64[ns](1), float64(8), int64(5), string(1)
memory usage: 117.4 KB

Having to do some initial variable type cleaning is a normal and unavoidable part of data analysis, especially when reading in from a format like CSV (which does not preserve data type but has great interoperability across systems). It’s important and what we’ve seen here is a typical pattern.

There’s just one more step before starting the EDA proper. If you haven’t reached the Chapter on best practice, which talks about naming conventions, you won’t know this yet but the usual naming convention for variables and columns in Python is so-called snake case (appropriately enough). An example of a word not in snake case would be ‘ThisIsAColName’; actually that’s in Camel Case. Rendered in snake case, it would be ‘this_is_a_col_name’, which is longer but slightly easier to read. So we should really change the column names to be in snake case. Fortunately, the dataprep package has us covered here; it provides methods to translate between pretty much every naming convention that exists. Let’s change our columns to snake case, which is the default setting of clean_headers.

TODO: add this example when dataprep updated

from dataprep.clean import clean_headers

df = clean_headers(df)

Now we can start the exploratory data analysis!

3.2. EDA using pandas built-in methods

pandas has some great options for built-in EDA; in fact we’ve already seen one of them, df.info() which, as well as reporting datatypes and memory usage, also tells us how many observations in each column are ‘truthy’ rather than ‘falsy’, ie how many have non-null values.

3.2.1. Exploratory tables and descriptive statistics

A small step beyond df.info to get tables is to use df.describe() which, if you have mixed datatypes that include floats, will report some basic summary statistics:

df.describe()
Date Baths SquareFeet LotSize YearBuilt YearSold MonthSold DaySold CostPerSqFt OrigPrice ListPrice SalePrice SPLPPct
count 929.000000 929.000000 911.000000 741.000000 929.000000 929.000000 929.000000 929.000000 929.000000 929.000000 929.000000 929.000000 929.000000
mean 18223.215285 1.779333 1583.215148 0.723463 1945.903122 2009.365985 6.831001 16.226050 83.291561 146047.083961 141399.138859 133203.720129 93.563994
std 1139.171002 0.743154 655.664966 2.725697 37.209902 3.124149 3.035790 9.396815 35.762241 85174.109372 82228.201763 77959.624697 6.953764
min 16527.000000 0.000000 640.000000 0.028926 1870.000000 2005.000000 1.000000 1.000000 0.000000 5990.000000 10500.000000 7000.000000 46.500000
25% 17169.000000 1.000000 1150.000000 0.233884 1900.000000 2007.000000 5.000000 8.000000 60.820000 89900.000000 88000.000000 83000.000000 91.320000
50% 18046.000000 1.750000 1440.000000 0.284091 1956.000000 2009.000000 7.000000 16.000000 82.880000 129900.000000 125000.000000 119340.000000 94.920000
75% 19281.000000 2.000000 1833.000000 0.370179 1973.000000 2012.000000 9.000000 25.000000 105.770000 179000.000000 172000.000000 162500.000000 97.560000
max 20186.000000 6.000000 6815.000000 55.000000 2013.000000 2015.000000 12.000000 31.000000 258.790000 695000.000000 695000.000000 606000.000000 111.020000

Although helpful, that sure is hard to read! We can improve this by using the round() method too:

sum_table = df.describe().round(1)
sum_table
Date Baths SquareFeet LotSize YearBuilt YearSold MonthSold DaySold CostPerSqFt OrigPrice ListPrice SalePrice SPLPPct
count 929.0 929.0 911.0 741.0 929.0 929.0 929.0 929.0 929.0 929.0 929.0 929.0 929.0
mean 18223.2 1.8 1583.2 0.7 1945.9 2009.4 6.8 16.2 83.3 146047.1 141399.1 133203.7 93.6
std 1139.2 0.7 655.7 2.7 37.2 3.1 3.0 9.4 35.8 85174.1 82228.2 77959.6 7.0
min 16527.0 0.0 640.0 0.0 1870.0 2005.0 1.0 1.0 0.0 5990.0 10500.0 7000.0 46.5
25% 17169.0 1.0 1150.0 0.2 1900.0 2007.0 5.0 8.0 60.8 89900.0 88000.0 83000.0 91.3
50% 18046.0 1.8 1440.0 0.3 1956.0 2009.0 7.0 16.0 82.9 129900.0 125000.0 119340.0 94.9
75% 19281.0 2.0 1833.0 0.4 1973.0 2012.0 9.0 25.0 105.8 179000.0 172000.0 162500.0 97.6
max 20186.0 6.0 6815.0 55.0 2013.0 2015.0 12.0 31.0 258.8 695000.0 695000.0 606000.0 111.0

Now, although this is very basic, let’s say you wanted to include it as a table of summary statistics in a paper. This is just a dataframe so you can export it just like you can any pandas dataframe. This includes options like .to_excel. For inclusion in a paper or report, you’re most likely to want to report it as text:

print(sum_table.to_string())
          Date  Baths  SquareFeet  LotSize  YearBuilt  YearSold  MonthSold  DaySold  CostPerSqFt  OrigPrice  ListPrice  SalePrice  SPLPPct
count    929.0  929.0       911.0    741.0      929.0     929.0      929.0    929.0        929.0      929.0      929.0      929.0    929.0
mean   18223.2    1.8      1583.2      0.7     1945.9    2009.4        6.8     16.2         83.3   146047.1   141399.1   133203.7     93.6
std     1139.2    0.7       655.7      2.7       37.2       3.1        3.0      9.4         35.8    85174.1    82228.2    77959.6      7.0
min    16527.0    0.0       640.0      0.0     1870.0    2005.0        1.0      1.0          0.0     5990.0    10500.0     7000.0     46.5
25%    17169.0    1.0      1150.0      0.2     1900.0    2007.0        5.0      8.0         60.8    89900.0    88000.0    83000.0     91.3
50%    18046.0    1.8      1440.0      0.3     1956.0    2009.0        7.0     16.0         82.9   129900.0   125000.0   119340.0     94.9
75%    19281.0    2.0      1833.0      0.4     1973.0    2012.0        9.0     25.0        105.8   179000.0   172000.0   162500.0     97.6
max    20186.0    6.0      6815.0     55.0     2013.0    2015.0       12.0     31.0        258.8   695000.0   695000.0   606000.0    111.0

as Markdown:

print(sum_table.to_markdown())
|       |    Date |   Baths |   SquareFeet |   LotSize |   YearBuilt |   YearSold |   MonthSold |   DaySold |   CostPerSqFt |   OrigPrice |   ListPrice |   SalePrice |   SPLPPct |
|:------|--------:|--------:|-------------:|----------:|------------:|-----------:|------------:|----------:|--------------:|------------:|------------:|------------:|----------:|
| count |   929   |   929   |        911   |     741   |       929   |      929   |       929   |     929   |         929   |       929   |       929   |       929   |     929   |
| mean  | 18223.2 |     1.8 |       1583.2 |       0.7 |      1945.9 |     2009.4 |         6.8 |      16.2 |          83.3 |    146047   |    141399   |    133204   |      93.6 |
| std   |  1139.2 |     0.7 |        655.7 |       2.7 |        37.2 |        3.1 |         3   |       9.4 |          35.8 |     85174.1 |     82228.2 |     77959.6 |       7   |
| min   | 16527   |     0   |        640   |       0   |      1870   |     2005   |         1   |       1   |           0   |      5990   |     10500   |      7000   |      46.5 |
| 25%   | 17169   |     1   |       1150   |       0.2 |      1900   |     2007   |         5   |       8   |          60.8 |     89900   |     88000   |     83000   |      91.3 |
| 50%   | 18046   |     1.8 |       1440   |       0.3 |      1956   |     2009   |         7   |      16   |          82.9 |    129900   |    125000   |    119340   |      94.9 |
| 75%   | 19281   |     2   |       1833   |       0.4 |      1973   |     2012   |         9   |      25   |         105.8 |    179000   |    172000   |    162500   |      97.6 |
| max   | 20186   |     6   |       6815   |      55   |      2013   |     2015   |        12   |      31   |         258.8 |    695000   |    695000   |    606000   |     111   |

as HTML:

print(sum_table.head(3).to_html())
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Date</th>
      <th>Baths</th>
      <th>SquareFeet</th>
      <th>LotSize</th>
      <th>YearBuilt</th>
      <th>YearSold</th>
      <th>MonthSold</th>
      <th>DaySold</th>
      <th>CostPerSqFt</th>
      <th>OrigPrice</th>
      <th>ListPrice</th>
      <th>SalePrice</th>
      <th>SPLPPct</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>count</th>
      <td>929.0</td>
      <td>929.0</td>
      <td>911.0</td>
      <td>741.0</td>
      <td>929.0</td>
      <td>929.0</td>
      <td>929.0</td>
      <td>929.0</td>
      <td>929.0</td>
      <td>929.0</td>
      <td>929.0</td>
      <td>929.0</td>
      <td>929.0</td>
    </tr>
    <tr>
      <th>mean</th>
      <td>18223.2</td>
      <td>1.8</td>
      <td>1583.2</td>
      <td>0.7</td>
      <td>1945.9</td>
      <td>2009.4</td>
      <td>6.8</td>
      <td>16.2</td>
      <td>83.3</td>
      <td>146047.1</td>
      <td>141399.1</td>
      <td>133203.7</td>
      <td>93.6</td>
    </tr>
    <tr>
      <th>std</th>
      <td>1139.2</td>
      <td>0.7</td>
      <td>655.7</td>
      <td>2.7</td>
      <td>37.2</td>
      <td>3.1</td>
      <td>3.0</td>
      <td>9.4</td>
      <td>35.8</td>
      <td>85174.1</td>
      <td>82228.2</td>
      <td>77959.6</td>
      <td>7.0</td>
    </tr>
  </tbody>
</table>

or as latex:

print(sum_table.to_latex(caption='Summary stats from EDA'))
\begin{table}
\centering
\caption{Summary stats from EDA}
\begin{tabular}{lrrrrrrrrrrrrr}
\toprule
{} &     Date &  Baths &  SquareFeet &  LotSize &  YearBuilt &  YearSold &  MonthSold &  DaySold &  CostPerSqFt &  OrigPrice &  ListPrice &  SalePrice &  SPLPPct \\
\midrule
count &    929.0 &  929.0 &       911.0 &    741.0 &      929.0 &     929.0 &      929.0 &    929.0 &        929.0 &      929.0 &      929.0 &      929.0 &    929.0 \\
mean  &  18223.2 &    1.8 &      1583.2 &      0.7 &     1945.9 &    2009.4 &        6.8 &     16.2 &         83.3 &   146047.1 &   141399.1 &   133203.7 &     93.6 \\
std   &   1139.2 &    0.7 &       655.7 &      2.7 &       37.2 &       3.1 &        3.0 &      9.4 &         35.8 &    85174.1 &    82228.2 &    77959.6 &      7.0 \\
min   &  16527.0 &    0.0 &       640.0 &      0.0 &     1870.0 &    2005.0 &        1.0 &      1.0 &          0.0 &     5990.0 &    10500.0 &     7000.0 &     46.5 \\
25\%   &  17169.0 &    1.0 &      1150.0 &      0.2 &     1900.0 &    2007.0 &        5.0 &      8.0 &         60.8 &    89900.0 &    88000.0 &    83000.0 &     91.3 \\
50\%   &  18046.0 &    1.8 &      1440.0 &      0.3 &     1956.0 &    2009.0 &        7.0 &     16.0 &         82.9 &   129900.0 &   125000.0 &   119340.0 &     94.9 \\
75\%   &  19281.0 &    2.0 &      1833.0 &      0.4 &     1973.0 &    2012.0 &        9.0 &     25.0 &        105.8 &   179000.0 &   172000.0 &   162500.0 &     97.6 \\
max   &  20186.0 &    6.0 &      6815.0 &     55.0 &     2013.0 &    2015.0 &       12.0 &     31.0 &        258.8 &   695000.0 &   695000.0 &   606000.0 &    111.0 \\
\bottomrule
\end{tabular}
\end{table}

And remember, with all of these, you can write them to file using the following structure:


open('sum_stats_table.tex', 'w').write(sum_table.to_latex(caption='Summary stats from EDA'))

Of course, the stats provided in this pre-built table are not very customised. So what do we do to get the table that we actually want? Well, the answer is to draw on the contents of the previous data chapters, particularly the introduction to data analysis. Groupbys, merges, aggregations: use all of them to produce the EDA table that you want.

If you’re exploring data, you might also want to be able to read everything clearly and see any deviations from what you’d expect quickly. pandas has some built-in functionality that styles dataframes to help you. These styles persist when you export the dataframe to, say, Excel, too.

Here’s an example that highlights some ways of styling dataframes, making use of several features such as: changing the units (lambda function), unstacking into a wider format (unstack), fill NaNs with unobtrusive strings (.fillna('-')), removing numbers after the decimal place (.style.set_precision(0)), and adding a caption (.style.set_caption).

(df.groupby(['YearSold', 'Bedrooms'])
   .mean()['SalePrice']
   .apply(lambda x: x/1e3)
   .unstack()
   .fillna('-')
   .style
   .set_precision(0)
   .set_caption('Sale price (thousands)')
)
Sale price (thousands)
Bedrooms 0 1 2 3 4 5 6 7 8
YearSold
2005 155 - 82 112 141 174 205 81 -
2006 25 - 80 125 139 281 - - 205
2007 - - 83 110 171 245 400 - -
2008 238 62 82 132 158 236 221 - -
2009 - 84 78 104 129 159 - - -
2010 - - 68 120 161 195 - - -
2011 128 - 83 128 176 213 - - -
2012 191 38 98 132 154 213 305 - -
2013 382 88 77 129 197 182 141 - -
2014 55 65 94 147 157 198 - - -
2015 - - 93 126 213 - - - -

Although a neater one than we’ve seen, this is still a drab table of numbers. The eye is not immediately drawn to it!

To remedy that, let’s take a look at another styling technique: the use of colour. Let’s say we wanted to make a table that showed a cross-tabulation between year and number of bathrooms. Naturally, we’ll use pd.crosstab but we’ll ask that the values that appear in the table (counts) be lit up with a heatmap:

pd.crosstab(df['Bedrooms'], df['YearSold']).style.background_gradient(cmap='plasma')
YearSold 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Bedrooms
0 1 1 0 1 0 0 3 3 1 2 0
1 0 0 0 2 1 0 0 1 2 1 0
2 15 18 24 17 14 10 10 11 18 11 1
3 54 64 56 59 25 27 31 47 39 46 15
4 21 39 24 17 10 14 14 28 28 30 3
5 7 8 6 8 3 2 8 6 4 9 0
6 1 0 1 2 0 0 0 1 2 0 0
7 1 0 0 0 0 0 0 0 0 0 0
8 0 1 0 0 0 0 0 0 0 0 0

Here are a couple of other styling tips for dataframes.

First, use bars to show ordering:

(df
 .iloc[:10, -6:-1]
 .style
 .set_precision(0)
 .bar(subset=['CostPerSqFt', 'SalePrice'], color='#d65f5f')
)
CostPerSqFt OrigPrice ListPrice SalePrice SPLPPct
1 6 17000 10500 7000 67
2 22 35000 35000 27000 77
3 18 54000 47000 28000 60
4 26 65000 49000 30000 61
5 24 35000 35000 30750 88
6 39 45900 45900 42000 92
7 60 62500 56500 46000 81
8 43 54500 52500 49500 94
9 55 59900 52500 50000 95
10 34 50000 50000 50000 100

Second, use format to add informative suffixes and round numbers appropriately:

df.groupby(['Bedrooms'])[['SPLPPct']].mean().style.format("{:.0f}%")
SPLPPct
Bedrooms
0 97%
1 89%
2 92%
3 94%
4 94%
5 93%
6 90%
7 95%
8 82%

Third, use .hightlight_max, and similar commands, to show important entries:

df.iloc[:10, -6:-1].style.highlight_max().format("{:.0f}")
CostPerSqFt OrigPrice ListPrice SalePrice SPLPPct
1 6 17000 10500 7000 67
2 22 35000 35000 27000 77
3 18 54000 47000 28000 60
4 26 65000 49000 30000 61
5 24 35000 35000 30750 88
6 39 45900 45900 42000 92
7 60 62500 56500 46000 81
8 43 54500 52500 49500 94
9 55 59900 52500 50000 95
10 34 50000 50000 50000 100

You can find a full set of styling commands here.

3.2.2. Exploratory plotting

pandas has some built-in plotting options to help you look at data quickly. These can be accessed via .plot.* or .plot(), depending on the context. Let’s make a quick .plot() of the mean quarterly nominal Sale Price over time (with extra options passed via keyword arguments):

(df
 .set_index('datetime')
 .groupby(pd.Grouper(freq='3M'))['SalePrice'].mean()
 .apply(lambda x: x/1e3)
 .plot(title='House sales in Grinnell, Iowa', xlabel='', ylabel='Mean nominal sale price (000s USD)')
);
_images/data-exploratory-analysis_39_0.png

Again, if you can get the data in the right shape, you can plot it. The same function works with multiple lines

(df
 .set_index('datetime')
 .groupby(pd.Grouper(freq='3M'))[['OrigPrice', 'ListPrice', 'SalePrice']].mean()
 .apply(lambda x: x/1e3)
 .plot(style=['-', ':', '-.'])
);
_images/data-exploratory-analysis_41_0.png

Now let’s see some of the other quick .plot.* options.

Here’s a KDE plot. Just to show what’s possible, we’ll use the df.columnname syntax, an alternative to df['columnname'], and setting limits via keyword arguments.

df.SquareFeet.plot.kde(ylim=(0, None), xlim=(0, None));
_images/data-exploratory-analysis_43_0.png

A bar chart (use barh for horizontal orientation; rot sets rotation of labels):

df.value_counts('Bedrooms').sort_index().plot.bar(title='Counts', rot=0);
_images/data-exploratory-analysis_45_0.png

This chart shows why EDA is so important: if we are modelling residential housing, and we find that there are observations that don’t have any bedrooms, perhaps they should be excluded from the downstream analysis.

This next one, using .plot.hist, provides another instructive example of how EDA can reveal interesting issues with datasets. First, there are many more houses built in 1900 than seem plausible and, checking the documentation for the dataset, it does say that when a construction date is not known, it is entered as 1900-which explains what’s going on here. Second, some of the construction years are in the future! Is this plausible? It could be. Some houses are sold before the construction of an estate is completed; this could be the case here.

df['YearBuilt'].plot.hist(bins=30, title='Year of construction');
_images/data-exploratory-analysis_47_0.png

Boxplot:

(df
 [['OrigPrice', 'ListPrice', 'SalePrice']]
 .plot
 .box()
);
_images/data-exploratory-analysis_49_0.png
df['class_ln_price'] = pd.cut(np.log(df['OrigPrice']), bins=4, precision=0)

(df
 .set_index('datetime')
 .groupby([pd.Grouper(freq='A'), 'class_ln_price'])['class_ln_price']
 .count()
 .unstack()
 .plot.area(ylabel='Count'));
_images/data-exploratory-analysis_50_0.png

Scatter plot:

df.plot.scatter(x='SquareFeet', y='SalePrice', alpha=0.7, ylim=(0, None));
_images/data-exploratory-analysis_52_0.png

Hexbin:

df.plot.hexbin(y='ListPrice', x='SPLPPct', gridsize=10,
               cmap='inferno_r', C='CostPerSqFt', sharex=False);
_images/data-exploratory-analysis_54_0.png

3.3. The pandas-profiling package

The EDA we did using the built-in pandas functions was a bit limited and user-input heavy. The pandas-profiling library aims to automate the legwork of EDA for you. It generates ‘profile’ reports from a pandas DataFrame. For each column, many statistics are computed and then relayed in an interactive HTML report.

Let’s generate a report on our dataset using the minimal=True setting (the default settings produce a lot of computationally expensive extras):

from pandas_profiling import ProfileReport


profile = ProfileReport(df,
                        minimal=True,
                        title="Profiling Report: Grinnell House Sales")
profile.to_notebook_iframe()

This is a full on report about everything in our dataset! We can see, for instance, that we have 17 numerical variables, 0 boolean, and 4 categorical (which includes string), and 1 datetime. The overview also tells us the number of observations, that we have 1.2% of values missing across all columns, and we have zero duplicates.

The warnings page shows where pandas-profiling really shines. It flags potential issues with the data that should be taken into account in any subsequent analysis. For example, although not relevant here, the report will say if there are very unbalanced classes in a low cardinality categorical variable.

Let’s explore a couple of the warnings that have been raised. Address has high cardinality; but that’s okay, we expect most addresses to be different because the turn-over of houses is not that frequent. We also see that there are a lot of missing values for LotSize.

Now let’s dig down into a specific variable, SalePrice. You can view more about it using the ‘Toggle details’ button. We get four extra pages of detail about this variable, in addition to the headline statistics: a comprehensive page of descriptive statistics, a page with a histogram of the variable, a page with a table of common values, and a page of extreme values.

3.4. The dataprep package

The dataprep package offers very similar functionality to pandas-profiling; it produces an in-depth report on the input data. There isn’t much difference between them in general apart from it seems like dataprep has slightly better support for string column types (check out the details of Address below) and is a little bit richer on visualisations (it has interactive plots that you can change the axes of-look at the interactions page below).

Do note, however, that some column types can create an error-which is why we’ll pass in all but the last column of our dataframe from the previous set of examples. In this case, having a bin-category column seems to be an issue.

from dataprep.eda import create_report

report = create_report(df.iloc[:, :-1])
/Users/arthurturrell/opt/anaconda3/envs/codeforecon/lib/python3.8/site-packages/dask/dataframe/accessor.py:48: FutureWarning: The default value of regex will change from True to False in a future version.
  out = getattr(getattr(obj, accessor, obj), attr)(*args, **kwargs)
/Users/arthurturrell/opt/anaconda3/envs/codeforecon/lib/python3.8/site-packages/dask/dataframe/accessor.py:48: FutureWarning: The default value of regex will change from True to False in a future version.
  out = getattr(getattr(obj, accessor, obj), attr)(*args, **kwargs)
/Users/arthurturrell/opt/anaconda3/envs/codeforecon/lib/python3.8/site-packages/dask/dataframe/accessor.py:48: FutureWarning: The default value of regex will change from True to False in a future version.
  out = getattr(getattr(obj, accessor, obj), attr)(*args, **kwargs)
/Users/arthurturrell/opt/anaconda3/envs/codeforecon/lib/python3.8/site-packages/dask/dataframe/accessor.py:48: FutureWarning: The default value of regex will change from True to False in a future version.
  out = getattr(getattr(obj, accessor, obj), attr)(*args, **kwargs)
from IPython.core.display import display


display(report)
DataPrep Report

Overview

Dataset Statistics

Number of Variables 16
Number of Rows 929
Missing Cells 206
Missing Cells (%) 1.4%
Duplicate Rows 10
Duplicate Rows (%) 1.1%
Total Size in Memory 206.1 KB
Average Row Size in Memory 227.2 B

Variable Types

Numerical 13
Categorical 2
DateTime 1

Variables

Date

numerical

Distinct Count 709
Unique (%) 76.3%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 14.5 KB
Mean 18223.2153
Minimum 16527
Maximum 20186
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%

Quantile Statistics

Minimum 16527
5-th Percentile 16657.6
Q1 17169
Median 18046
Q3 19281
95-th Percentile 19990.4
Maximum 20186
Range 3659
IQR 2112

Descriptive Statistics

Mean 18223.2153
Standard Deviation 1139.171
Variance 1.2977e+06
Sum 1.6929e+07
Skewness 0.1729
Kurtosis -1.3933
Coefficient of Variation 0.06251

Address

categorical

Distinct Count 806
Unique (%) 86.8%
Missing 0
Missing (%) 0.0%
Memory Size 71.9 KB

Length

Mean 14.2573
Standard Deviation 2.4748
Median 13
Minimum 8
Maximum 23

Sample

1st row 1510 First Ave #11...
2nd row 1020 Center St
3rd row 918 Chatterton St
4th row 1023 & 1025 Spring...
5th row 503 2nd Ave

Letter

Count 6591
Lowercase Letter 5013
Space Separator 2804
Uppercase Letter 1578
Dash Punctuation 13
Decimal Number 3795

Bedrooms

categorical

Distinct Count 9
Unique (%) 1.0%
Missing 0
Missing (%) 0.0%
Memory Size 59.9 KB

Length

Mean 1
Standard Deviation 0
Median 1
Minimum 1
Maximum 1

Sample

1st row 2
2nd row 3
3rd row 4
4th row 3
5th row 3

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 929

Baths

numerical

Distinct Count 16
Unique (%) 1.7%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 14.5 KB
Mean 1.7793
Minimum 0
Maximum 6
Zeros 12
Zeros (%) 1.3%
Negatives 0
Negatives (%) 0.0%

Quantile Statistics

Minimum 0
5-th Percentile 1
Q1 1
Median 1.75
Q3 2
95-th Percentile 3
Maximum 6
Range 6
IQR 1

Descriptive Statistics

Mean 1.7793
Standard Deviation 0.7432
Variance 0.5523
Sum 1653
Skewness 0.9974
Kurtosis 2.6003
Coefficient of Variation 0.4177

SquareFeet

numerical

Distinct Count 526
Unique (%) 57.7%
Missing 18
Missing (%) 1.9%
Infinite 0
Infinite (%) 0.0%
Memory Size 14.2 KB
Mean 1583.2151
Minimum 640
Maximum 6815
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%

Quantile Statistics

Minimum 640
5-th Percentile 884
Q1 1150
Median 1440
Q3 1833
95-th Percentile 2855
Maximum 6815
Range 6175
IQR 683

Descriptive Statistics

Mean 1583.2151
Standard Deviation 655.665
Variance 429896.5471
Sum 1.4423e+06
Skewness 2.1205
Kurtosis 8.0646
Coefficient of Variation 0.4141

LotSize

numerical

Distinct Count 398
Unique (%) 53.7%
Missing 188
Missing (%) 20.2%
Infinite 0
Infinite (%) 0.0%
Memory Size 11.6 KB
Mean 0.7235
Minimum 0.02893
Maximum 55
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%

Quantile Statistics

Minimum 0.02893
5-th Percentile 0.14
Q1 0.2339
Median 0.2841
Q3 0.3702
95-th Percentile 1.82
Maximum 55
Range 54.9711
IQR 0.1363

Descriptive Statistics

Mean 0.7235
Standard Deviation 2.7257
Variance 7.4294
Sum 536.0862
Skewness 13.0898
Kurtosis 224.9221
Coefficient of Variation 3.7676

YearBuilt

numerical

Distinct Count 101
Unique (%) 10.9%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 14.5 KB
Mean 1945.9031
Minimum 1870
Maximum 2013
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%

Quantile Statistics

Minimum 1870
5-th Percentile 1900
Q1 1900
Median 1956
Q3 1973
95-th Percentile 2001.6
Maximum 2013
Range 143
IQR 73

Descriptive Statistics

Mean 1945.9031
Standard Deviation 37.2099
Variance 1384.5768
Sum 1.8077e+06
Skewness -0.1125
Kurtosis -1.3947
Coefficient of Variation 0.01912

YearSold

numerical

Distinct Count 11
Unique (%) 1.2%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 14.5 KB
Mean 2009.366
Minimum 2005
Maximum 2015
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%

Quantile Statistics

Minimum 2005
5-th Percentile 2005
Q1 2007
Median 2009
Q3 2012
95-th Percentile 2014
Maximum 2015
Range 10
IQR 5

Descriptive Statistics

Mean 2009.366
Standard Deviation 3.1241
Variance 9.7603
Sum 1.8667e+06
Skewness 0.1688
Kurtosis -1.3769
Coefficient of Variation 0.001555

MonthSold

numerical

Distinct Count 12
Unique (%) 1.3%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 14.5 KB
Mean 6.831
Minimum 1
Maximum 12
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%

Quantile Statistics

Minimum 1
5-th Percentile 2
Q1 5
Median 7
Q3 9
95-th Percentile 12
Maximum 12
Range 11
IQR 4

Descriptive Statistics

Mean 6.831
Standard Deviation 3.0358
Variance 9.216
Sum 6346
Skewness -0.07994
Kurtosis -0.9231
Coefficient of Variation 0.4444

DaySold

numerical

Distinct Count 31
Unique (%) 3.3%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 14.5 KB
Mean 16.226
Minimum 1
Maximum 31
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%

Quantile Statistics

Minimum 1
5-th Percentile 1
Q1 8
Median 16
Q3 25
95-th Percentile 30
Maximum 31
Range 30
IQR 17

Descriptive Statistics

Mean 16.226
Standard Deviation 9.3968
Variance 88.3001
Sum 15074
Skewness -0.09188
Kurtosis -1.2452
Coefficient of Variation 0.5791

CostPerSqFt

numerical

Distinct Count 856
Unique (%) 92.1%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 14.5 KB
Mean 83.2916
Minimum 0
Maximum 258.79
Zeros 18
Zeros (%) 1.9%
Negatives 0
Negatives (%) 0.0%

Quantile Statistics

Minimum 0
5-th Percentile 21.934
Q1 60.82
Median 82.88
Q3 105.77
95-th Percentile 143.968
Maximum 258.79
Range 258.79
IQR 44.95

Descriptive Statistics

Mean 83.2916
Standard Deviation 35.7622
Variance 1278.9379
Sum 77377.86
Skewness 0.1412
Kurtosis 0.4294
Coefficient of Variation 0.4294

OrigPrice

numerical

Distinct Count 385
Unique (%) 41.4%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 14.5 KB
Mean 146047.084
Minimum 5990
Maximum 695000
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%

Quantile Statistics

Minimum 5990
5-th Percentile 49900
Q1 89900
Median 129900
Q3 179000
95-th Percentile 313000
Maximum 695000
Range 689010
IQR 89100

Descriptive Statistics

Mean 146047.084
Standard Deviation 85174.1094
Variance 7.2546e+09
Sum 1.3568e+08
Skewness 1.8714
Kurtosis 5.8245
Coefficient of Variation 0.5832

ListPrice

numerical

Distinct Count 373
Unique (%) 40.2%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 14.5 KB
Mean 141399.1389
Minimum 10500
Maximum 695000
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%

Quantile Statistics

Minimum 10500
5-th Percentile 46360
Q1 88000
Median 125000
Q3 172000
95-th Percentile 292800
Maximum 695000
Range 684500
IQR 84000

Descriptive Statistics

Mean 141399.1389
Standard Deviation 82228.2018
Variance 6.7615e+09
Sum 1.3136e+08
Skewness 1.8469
Kurtosis 5.8706
Coefficient of Variation 0.5815

SalePrice

numerical

Distinct Count 405
Unique (%) 43.6%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 14.5 KB
Mean 133203.7201
Minimum 7000
Maximum 606000
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%

Quantile Statistics

Minimum 7000
5-th Percentile 37000
Q1 83000
Median 119340
Q3 162500
95-th Percentile 281900
Maximum 606000
Range 599000
IQR 79500

Descriptive Statistics

Mean 133203.7201
Standard Deviation 77959.6247
Variance 6.0777e+09
Sum 1.2375e+08
Skewness 1.6968
Kurtosis 4.8016
Coefficient of Variation 0.5853

SPLPPct

numerical

Distinct Count 593
Unique (%) 63.8%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 14.5 KB
Mean 93.564
Minimum 46.5
Maximum 111.02
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%

Quantile Statistics

Minimum 46.5
5-th Percentile 82.13
Q1 91.32
Median 94.92
Q3 97.56
95-th Percentile 100
Maximum 111.02
Range 64.52
IQR 6.24

Descriptive Statistics

Mean 93.564
Standard Deviation 6.9538
Variance 48.3548
Sum 86920.95
Skewness -2.5043
Kurtosis 10.8721
Coefficient of Variation 0.07432

datetime

datetime

Distinct Count 709
Unique (%) 76.3%
Missing 0
Missing (%) 0.0%
Memory Size 14.5 KB
Minimum 2005-04-02 00:00:00
Maximum 2015-04-09 00:00:00

Interactions

Correlations

Missing Values

3.5. Review

That’s it for this very brief tour of exploratory data analysis with code. You should know now how to do simple analysis of pandas dataframes using techniques from other chapters to create summary statistics, using styling of tables of summary statistics, and using the quick visualisation options that are built-in to pandas. You should also know where to go for more in-depth, heavy duty exploratory data analysis reports, namely pandas-profiling and dataprep.